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Use of diplotypes - matched haplotype pairs from homologous 
chromosomes - in gene-disease association studies 
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Summary: Alleles, genotypes and haplotypes (combinations of alleles) have been widely used in gene-disease 
association studies. More recently, association studies using diplotypes (haplotype pairs on homologous 
chromosomes) have become increasingly common. This article reviews the rationale of the four types of 
association analyses and discusses the situations in which diplotype-based analyses are more powerful than 
the other types of association analyses. Haplotype-based association analyses are more powerful than allele- 
based association analyses, and diplotype-based association analyses are more powerful than genotype-based 
analyses. In circumstances where there are no interaction effects between markers and where the criteria 
for Hardy-Weinberg Equilibrium (HWE) are met, the larger sample size and smaller degrees of freedom of 
allele-based and haplotype-based association analyses make them more powerful than genotype-based and 
diplotype-based association analyses, respectively. However, under certain circumstances diplotype-based 
analyses are more powerful than haplotype-based analysis. 

Key words: diplotype, haplotype, association analysis, genotypes, interaction effects, Hardy-Weinberg 
equilibrium 



1. Introduction: definition and composition of diplotypes 

Humans are diploid organisms; they have paired 
homologous chromosomes in their somatic cells, 
which contain two copies of each gene. An allele is one 
member of a pair of genes occupying a specific spot 
on a chromosome (called locus). Two alleles at the 
same locus on homologous chromosomes make up the 
individual's genotype. A haplotype (a contraction of the 
term 'haploid genotype') is a combination of alleles at 
multiple loci that are transmitted together on the same 
chromosome. Haplotype may refer to as few as two loci 
or to an entire chromosome depending on the number 
of recombination events that have occurred between a 
given set of loci. Genewise haplotypes are established 
with markers within a gene; familywise haplotypes are 
established with markers within members of a gene 
family; and regionwise haplotypes are established within 
different genes in a region at the same chromosome. 
Finally, a diplotype is a matched pair of haplotypes on 
homologous chromosomes.' 11 (see Figure 1). 
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Figure 1. Model of alleles, genotypes, haplotypes 
and diplotypes on a pair of chromosomes 




Alleles: A, B, C, a, b and c 
Genotypes: A/a; B/b and C/c 
Haplotypes: ABC and abc 
Diplotype: ABC/abc 



A full-text Chinese translation of this article will be available at www.saponline.org on July 25, 2014. 
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Traditionally, the expectation-maximum (EM) 
algorithm has been used to estimate haplotype 
frequencies.' 2,31 This algorithm assumes Hardy-Weinberg 
Equilibrium (HWE). [41 However, if the genotype frequency 
distributions of individual markers are not in HWE, the 
assumption of the EM algorithm will be violated. The 
magnitude of the error of the EM estimates is greater 
when the HWE violation (the so-called Hardy-Weinberg 
Disequilibrium [HWD]) is attributable to a greater 
expected heterozygote frequency than the observed 
heterozygote frequency.' 41 

Several programs can be used to construct both 
haplotypes and diplotypes. The HelixTree program 151 is 
based on the EM algorithm. New-generation programs 
such as the PHASE program are based on the Bayesian 
approach and the Partition Ligation algorithm; their 
proponents claim that they are more accurate in 
constructing haplotypes than the traditional programs 
based on the EM algorithm. 16 " 81 Both HelixTree 
and PHASE can estimate the diplotype frequency 
distributions among a population and estimate 
the diplotype probabilities for each individual. The 
probabilities of unambiguously observed diplotypes for 
each individual estimated by these programs should 
be 1.0; the probabilities of inferred diplotypes for each 
subject will be between 0.0 and 1.0. 

2. Diplotype-based association analysis: application 
and interpretation 

Haplotype-based and diplotype-based association 
analyses are more powerful than allele-based 
and genotype-based analyses. 19 " 111 Under certain 
circumstances (reviewed below), diplotype-based 
analysis is more powerful than haplotype-based 
analysis. Under these specific circumstances, diplotype- 
based association analysis is the most powerful of the 
four types of association analyses, a finding that has 
been confirmed in about 200 studies since 2002. 112131 
For example, Lee and colleagues 1141 found that the 111 
haplotype of the Calpain-10 gene was associated with 
an increased risk of polycystic ovary syndrome (PCOS) 
(OR=2.4; 95% CI 1.8-3.3), the 112 haplotype was 
associated with a decreased risk of PCOS (OR=0.6; 95% 
CI 0.4-0.8), and the 121 haplotype was not associated 
with PCOS; however, the 111/121 diplotype was more 
strongly associated with increased susceptibility to PCOS 
than any of the haplotypes (OR=3.4; 95% CI 2.2-5.2). 
Luo and colleagues' 15 " 221 reported that the diplotypes 
at ADH1A, IB, 1C, 4 and 7, CHRM2, OPRM1, OPRD1 
and OPRK1 were much more strongly associated with 
alcohol dependence, drug dependence and personality 
factors than the alleles, genotypes and haplotypes at 
these sites. And Li and colleagues 1231 found that specific 
growth traits were significantly associated with the 
diplotypes of four individual SNPs at IGF-II but not with 
the haplotypes of these SNPs. Similar findings have 
been reported in other studies.' 24,251 

There are several possible interpretations of these 
findings: 



2.1 Haplotypes and diplotypes contain more 
information than alleles and genotypes 

As shown in Figure 1, a haplotype is a combination of 
alleles from multiple loci on a single chromosome, a 
genotype is composed of two alleles on homologous 
chromosomes, and a diplotype is composed of two 
haplotypes (i.e., multiple genotypes) on homologous 
chromosomes. Theoretically, the information contained 
in a multi-locus haplotype is greater than that in a 
single-locus allele and the information contained in a 
multi-locus diplotype is greater than that contained in a 
single-locus genotype. Similarly, haplotypes with more 
alleles contain more information than those with less 
alleles and diplotypes with more genotypes contain 
more information than those with less genotypes. 

A multi-locus haplotype is a specific variant of all 
possible combinations of single-locus alleles on the 
chromosome; both alleles and haplotypes reflect the 
features of chromosomes in the population. A diplotype 
is a specific variant of all possible combinations of 
single-locus genotypes on the paired chromosomes; 
both genotypes and diplotypes represent the types 
of chromosome pairs in each individual (see Table 1). 
A diplotype can also be conceptualized as a specific 
variant of all possible combinations of haplotypes from 
the two participating chromosomes. So haplotype- 
based analyses are equivalent to a stratified analysis of 
all alleles (at all loci), and diplotype-based analyses are 
equivalent to both stratified analysis of all genotypes at 
all loci, and to stratified analysis of all haplotypes. Thus, 
when the sample size is sufficiently large, haplotype- 
and diplotype-based analyses should be more powerful 
than allele-based and genotype-based analyses. 
Similarly, the analysis of an individual diplotype should 
be more informative than analysis of the corresponding 
individual haplotype. 

Two alleles at one biallelic marker can divide the 
chromosomes in a population into two categories; 
these two alleles would result in three genotypes at 
the specified marker on homologous chromosomes 
and, thus, could be used to divide the individuals 
in a population into three categories. Assuming n 
independent biallelic markers, up to 2" haplotypes 
constructed by these n markers can divide the 
chromosomes in a population into 2" categories. At 
the same time, n independent biallelic markers would 
result in up to 2 n (2"+l)/2 diplotypes on the paired 
chromosomes, dividing the individuals in a population 
into 2 n (2 n +l)/2 categories. (Note: each of these 
2 n (2"+l)/2 diplotype categories is a subset of one 
of the 2" haplotype categories.) When the sample 
size is large enough, dividing a sample into more 
categories increases the ability to identify meaningful 
variance between different subgroups in the sample, 
so haplotype-based and diplotype-based analyses 
are more powerful than allele-based and genotype- 
based analyses and an individual's diplotype is more 
informative than an individual's haplotype. However, 
the overall diplotype-based analysis may not be more 
powerful than the corresponding haplotype-based 
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able 1. Comparison of haplotype-based and diplotype-based association analyses 






Haplotype-based association analysis 


Diplotype-based association analysis 


Composition 


A haplotype is a subset of all alleles on 
specific chromosomes in the population. 


A diplotype is a subset of all genotypes 
on homologous chromosome pairs in 
the population. A specific diplotype is 
one variant of all possible combinations 
of the haplotypes that exist in the 
population. 


Feature 


Both alleles and haplotypes reflect 
the components of chromosomes in 
individuals and in the population. 


Both genotypes and diplotypes reflect 
the components of chromosome pairs 
in individuals and in the population. 


n independent single-nucleotide 
polymorphisms (SNPs) 


At most 2" haplotypes 


At most 2 n (2 n +l)/2 diplotypes 


Degrees of freedom in analysis 


2 -1 


[2 (2 +l)/2]-l 


Markers not in Hardy-Weinberg 
Equilibrium (HWE) 


Less powerful predictor of disease status 


More powerful predictor of disease 
status 


Recessive genetic model 


Less powerful predictor of disease status 


More powerful predictor of disease 
status 


With interaction 


Less powerful predictor of disease status 


More powerful predictor of disease 
status 


Without interaction 


More powerful predictor of disease status 


Less powerful predictor of disease 
status 


Sample size (n individuals) 


2n 


n 


Frequency of rare categories 


Less common 


More common (decrease power) 



analysis because in some situations the much greater 
degrees of freedom in a diplotype-based analysis than 
in the corresponding haplotype-based analysis weakens 
the strength of the identified associations. 

The multi-locus haplotype and diplotype are 
composed of multiple markers that are in linkage 
disequilibrium (LD). They contain information from all 
of these individual markers and from several unknown 
flanking markers on the same chromosome. They 
are, therefore, usually more informative and closer to 
representing a 'whole gene' than single-marker alleles 
and genotypes. This is particularly the case when several 
of the known and unknown markers are etiologically 
related to the disease(s) of interest.' 9 " 111 

2.2 Genotype-based and diplotype-based analyses 
remain valid in the presence of Hardy-Weinberg 
Disequilibrium 

When the genotype frequency distributions of some 
markers are not in Hardy-Weinberg Equilibrium the 
allele-based and haplotype-based analyses become 
less powerful and may be invalid, but the genotype- 
based and diplotype-based analyses are still valid. 
When there is Hardy-Weinberg Disequilibrium the 
marker alleles and haplotypes are not independent 
of each other so the effects of disease predisposing 
alleles and haplotypes may be 'masked' by other non- 



disease predisposing alleles and haplotypes or, in 
the case of a recessive condition, by the presence of a 
dominant allele on the homologous chromosome. This 
weakens or invalidates the strength of the association 
between the allele or haplotype and the disease(s) of 
interest. However, genotype-based and diplotype-based 
association analyses remain valid even in the presence 
of strong Hardy-Weinberg disequilibrium. This has been 
demonstrated in several studies.' 15 " 18,27 " 301 



2.3 Haplotype and diplotype analyses incorporate 
interaction effects and, thus, are more informative 
when interaction between assessed markers is 
present 

The haplotypes or diplotypes incorporate information 
on linkage disequilibrium among markers; so 
information on the multivariate interaction effects 
between markers are incorporated into haplotype- 
based and diplotype-based analyses.' 311 In most 
cases' 18,20 " 221 reported interaction effects between alleles 
and between genotypes are similar to those seen 
with corresponding multi-locus haplotype-based and 
diplotype-based analyses; this supports the contention 
that diplotype-based analyses incorporate information 
on the interactions between different markers and 
between different haplotypes. The interaction effect is 
often a more powerful predictor of disease status than 
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the main effect, especially when the main effects 
are marginal,' 331 so when interaction effects occur 
diplotype-based association analyses would likely be 
more informative than association analyses based on 
haplotypes, genotypes or alleles. 

2.4 Using quantitative measures instead of categorical 
measures makes diplotype-based analysis more 
powerful 

Programs implementing the Bayesian approach can 
estimate the probabilities of all possible pairs of 
haplotypes (i.e., a 'full model' in which the probabilities 
of all diplotype categories are assessed) or the 
probabilities of the most relevant subset of diplotype 
categories (i.e., a "reduced" model) for each individual. 
The estimated diplotype probabilities are quantitative 
measures so they usually preserve more information 
than the original categorical list of the different 
diplotype categories. Thus the analyses are more 
powerful if they employ diplotype probabilities instead 
of diplotype categories. 

2.5 Avoiding multiple testing preserves the power of 
haplotype-based and diplotype-based analyses 

When testing the association between single markers 
and a phenotype, multiple independent tests are 
required so the analysis needs to be adjusted for 
multiple testing, which reduces the power of the analysis 
to identify significant differences between groups. But 
there is no need to adjust for multiple testing when 
incorporating multiple markers into haplotype-based or 
diplotype-based analyses, preserving the power of the 
analysis.' 341 This is another reason that haplotype-based 
and diplotype-based association analyses are more 
powerful than single-locus analyses. 

3. Discussion: conclusion and future aspects 

This review shows that haplotype-based association 
analyses are more powerful than allele-based 
association analyses and that diplotype-based 
association analyses are more powerful than genotype- 
based analyses. Moreover, under certain circumstances, 
diplotype-based analyses are more powerful than 
haplotype-based analysis. Thus, in circumstances where 
very large sample sizes are available, diplotype-based 



association analysis is the most powerful of the four 
potential analytic strategies. 

The sample sizes of association analyses based 
on alleles and haplotypes are twice those of the 
corresponding association analyses based on genotypes 
and diplotypes. And the degrees of freedom in allele- 
based and haplotype-based analyses are much less 
than the degrees of freedom of the corresponding 
genotype-based and diplotype-based analyses. Thus in 
circumstances where there are no interaction effects 
between markers and where the criteria for Hardy- 
Weinberg Equilibrium are met, allele-based association 
analyses are more powerful than genotype-based 
analyses and haplotype-based association analyses 
are more powerful than diplotype-based analyses. [9 ' 331 
However, in several other circumstances the diplotype- 
based analysis is more powerful than haplotype- 
based analyses: (a) when there are interaction effects 
between haplotypes, (b) when there is Hardy-Weinberg 
Disequilibrium, and (c) when considering a recessive 
model of inheritance.' 331 

One disadvantage of diplotype-based analysis 
compared to haplotype-based analysis is that there are 
typically a greater number of rare diplotype categories 
(i.e., categories with few individuals) than the number of 
rare haplotype categories. For each category, no matter 
how small, an additional degree of freedom needs to 
be included in the analysis, so this results in a greater 
decrease in the power of diplotype-based association 
tests compared to haplotype-based association tests. 
Strategies to deal with rare observations include 
excluding such categories or merging them with other 
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